Goto

Collaborating Authors

 spatial cognition


SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Xu, Peiran, Wang, Sudong, Zhu, Yao, Li, Jianing, Zhang, Yunjian

arXiv.org Artificial Intelligence

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. T o address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. T o provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.


Solving Spatial Supersensing Without Spatial Supersensing

Udandarao, Vishaal, Karthik, Shyamgopal, Nath, Surabhi S., Hochlehnert, Andreas, Bethge, Matthias, Prabhu, Ameya

arXiv.org Artificial Intelligence

Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity


RynnEC: Bringing MLLMs into Embodied World

Dang, Ronghao, Yuan, Yuqian, Mao, Yunxuan, Li, Kehan, Liu, Jiangpin, Wang, Zhikai, Li, Xin, Wang, Fan, Zhao, Deli

arXiv.org Artificial Intelligence

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC


A Preliminary Exploration of the Differences and Conjunction of Traditional PNT and Brain-inspired PNT

He, Xu, Meng, Xiaolin, Yin, Wenxuan, Zhang, Youdong, Mo, Lingfei, An, Xiangdong, Yu, Fangwen, Pan, Shuguo, Liu, Yufeng, Liu, Jingnan, Zhang, Yujia, Gao, Wang

arXiv.org Artificial Intelligence

Developing universal Positioning, Navigation, and Timing (PNT) is our enduring goal. Today's complex environments demand PNT that is more resilient, energy - efficient and cognitively capable. This paper asks how we can endow unmanned systems with brain - inspired spatial cogniti on navigation while exploiting the h igh precision of machine PNT to advance universal PNT. We provide a new perspective and roadmap for shifting PNT from "tool - or iented " to "cogniti on - driven ". Contributions: (1) multi - level dissection of differences among traditional PNT, biological brain PN T and brain - inspired PNT; (2) a four - layer (observation - c apability - decision - hardware) fusion framework that unites numerical precision and brain - inspired intelligence; (3) forward - looking recommendations for future development of brain - inspired PNT . Keywords: Brain - inspired n avigation, PNT, Differences and Conjunction, Fusion F ramework 1. Introduction Unmanned system P ositioning, N avigation, and T iming (PNT) technologies have achieved numerous practical advance s. Particularly noteworthy is the rapid maturation of Global Navigation Satellite System (GNSS) - based PNT, which has not only expanded its application domains but also driven down operational costs. However, these technologies still face formidable challenges in highly uncertain and complex scenarios, such as deep s pace, the deep ocean, polar regions, and dense urban environments.


11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

Li, Chengzu, Wu, Wenshan, Zhang, Huanyu, Li, Qingtao, Gao, Zeyu, Xia, Yan, Hernández-Orallo, José, Vulić, Ivan, Wei, Furu

arXiv.org Artificial Intelligence

For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.


Mimicking associative learning of rats via a neuromorphic robot in open field maze using spatial cell models

Liu, Tianze, Siddique, Md Abu Bakr, An, Hongyu

arXiv.org Artificial Intelligence

--Data-driven Artificial Intelligence (AI) approaches have exhibited remarkable prowess across various cognitive tasks using extensive training data. However, the reliance on large datasets and neural networks presents challenges such as high-power consumption and limited adaptability, particularly in SWaP-constrained applications like planetary exploration. T o address these issues, we propose enhancing the autonomous capabilities of intelligent robots by emulating the associative learning observed in animals. Associative learning enables animals to adapt to their environment by memorizing concurrent events. By replicating this mechanism, neuromorphic robots can navigate dynamic environments autonomously, learning from interactions to optimize performance. This paper explores the emulation of associative learning in rodents using neuromorphic robots within open-field maze environments, leveraging insights from spatial cells such as place and grid cells. By integrating these models, we aim to enable online associative learning for spatial tasks in real-time scenarios, bridging the gap between biological spatial cognition and robotics for advancements in autonomous systems.


Can LLMs Learn to Map the World from Local Descriptions?

Xia, Sirui, Chen, Aili, Wang, Xintao, Zhu, Tinghui, Zhang, Yikai, Chen, Jiangjie, Xiao, Yanghua

arXiv.org Artificial Intelligence

Recent advances in Large Language Models (LLMs) have demonstrated strong capabilities in tasks such as code and mathematics. However, their potential to internalize structured spatial knowledge remains underexplored. This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. We focus on two core aspects of spatial cognition: spatial perception, where models infer consistent global layouts from local positional relationships, and spatial navigation, where models learn road connectivity from trajectory data and plan optimal paths between unconnected locations. Experiments conducted in a simulated urban environment demonstrate that LLMs not only generalize to unseen spatial relationships between points of interest (POIs) but also exhibit latent representations aligned with real-world spatial distributions. Furthermore, LLMs can learn road connectivity from trajectory descriptions, enabling accurate path planning and dynamic spatial awareness during navigation.


A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Feng, Jie, Zeng, Jinwei, Long, Qingyue, Chen, Hongyi, Zhao, Jie, Xi, Yanxin, Zhou, Zhilun, Yuan, Yuan, Wang, Shengyuan, Zeng, Qingbin, Li, Songwei, Zhang, Yunke, Lin, Yuming, Li, Tong, Ding, Jingtao, Gao, Chen, Xu, Fengli, Li, Yong

arXiv.org Artificial Intelligence

Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales -- from embodied to urban and global levels -- following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.


Does Spatial Cognition Emerge in Frontier Models?

Ramakrishnan, Santhosh Kumar, Wijmans, Erik, Kraehenbuehl, Philipp, Koltun, Vladlen

arXiv.org Artificial Intelligence

Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.


Failures in Perspective-taking of Multimodal AI Systems

Leonard, Bridget, Woodard, Kristin, Murray, Scott O.

arXiv.org Artificial Intelligence

This study extends previous research on spatial representations in multimodal AI systems. Although current models demonstrate a rich understanding of spatial information from images, this information is rooted in propositional representations, which differ from the analog representations employed in human and animal spatial cognition. To further explore these limitations, we apply techniques from cognitive and developmental science to assess the perspective-taking abilities of GPT-4o. Our analysis enables a comparison between the cognitive development of the human brain and that of multimodal AI, offering guidance for future research and model development.